{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 题目：对活动进行分类"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "解题提示：\n",
    "文件说明：\n",
    "1. 可以先运行0. EDA.ipynb，看一下竞赛所有数据的情况；\n",
    "2. 总体活动的数目太多（300w+记录），可以只需对训练集train.csv和测试集test.cv出现的活动（13418条记录）举行聚类即可。运行1. Users_Events.ipynb可得到只在训练集train.csv和测试集test.cv出现的活动，可自己修改代码存为csv格式，在进行聚类。\n",
    "\n",
    "作业要求：\n",
    "1. 抽取出只在训练集和测试集中出现的event：20分\n",
    "2. 聚类 ：40分\n",
    "3. CH_scores计算：20分\n",
    "4. 结果显示/分析：20分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 数据探索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#保存数据\n",
    "import pickle\n",
    "\n",
    "import itertools\n",
    "\n",
    "#处理事件字符串\n",
    "import datetime\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy.io as sio\n",
    "import scipy.sparse as ss\n",
    "\n",
    "#相似度/距离\n",
    "import scipy.spatial.distance as ssd\n",
    "\n",
    "from collections import defaultdict\n",
    "from sklearn.preprocessing import normalize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>event</th>\n",
       "      <th>invited</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>interested</th>\n",
       "      <th>not_interested</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3044012</td>\n",
       "      <td>1918771225</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-10-02 15:53:05.754000+00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3044012</td>\n",
       "      <td>1502284248</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-10-02 15:53:05.754000+00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3044012</td>\n",
       "      <td>2529072432</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-10-02 15:53:05.754000+00:00</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3044012</td>\n",
       "      <td>3072478280</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-10-02 15:53:05.754000+00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3044012</td>\n",
       "      <td>1390707377</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-10-02 15:53:05.754000+00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user       event  invited                         timestamp  interested  \\\n",
       "0  3044012  1918771225        0  2012-10-02 15:53:05.754000+00:00           0   \n",
       "1  3044012  1502284248        0  2012-10-02 15:53:05.754000+00:00           0   \n",
       "2  3044012  2529072432        0  2012-10-02 15:53:05.754000+00:00           1   \n",
       "3  3044012  3072478280        0  2012-10-02 15:53:05.754000+00:00           0   \n",
       "4  3044012  1390707377        0  2012-10-02 15:53:05.754000+00:00           0   \n",
       "\n",
       "   not_interested  \n",
       "0               0  \n",
       "1               0  \n",
       "2               0  \n",
       "3               0  \n",
       "4               0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " \"\"\"\n",
    "train.csv 有6列：\n",
    "user：用户ID\n",
    "event：活动ID\n",
    "invited：是否被邀请（0/1）\n",
    "timestamp：ISO-8601 UTC格式时间字符串，表示用户看到该活动的时间\n",
    "interested, and not_interested\n",
    "\n",
    "Test.csv 除了没有interested, and not_interested，其余列与train相同\n",
    " \"\"\"\n",
    "\n",
    "#读取数据\n",
    "train = pd.read_csv(\"train.csv\")\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 15398 entries, 0 to 15397\n",
      "Data columns (total 6 columns):\n",
      "user              15398 non-null int64\n",
      "event             15398 non-null int64\n",
      "invited           15398 non-null int64\n",
      "timestamp         15398 non-null object\n",
      "interested        15398 non-null int64\n",
      "not_interested    15398 non-null int64\n",
      "dtypes: int64(5), object(1)\n",
      "memory usage: 721.9+ KB\n"
     ]
    }
   ],
   "source": [
    "train.info()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "查看一下测试数据，好像没什么特别的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>event</th>\n",
       "      <th>invited</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1776192</td>\n",
       "      <td>2877501688</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-11-30 11:39:01.230000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1776192</td>\n",
       "      <td>3025444328</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-11-30 11:39:01.230000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1776192</td>\n",
       "      <td>4078218285</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-11-30 11:39:01.230000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1776192</td>\n",
       "      <td>1024025121</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-11-30 11:39:01.230000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1776192</td>\n",
       "      <td>2972428928</td>\n",
       "      <td>0</td>\n",
       "      <td>2012-11-30 11:39:21.985000+00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user       event  invited                         timestamp\n",
       "0  1776192  2877501688        0  2012-11-30 11:39:01.230000+00:00\n",
       "1  1776192  3025444328        0  2012-11-30 11:39:01.230000+00:00\n",
       "2  1776192  4078218285        0  2012-11-30 11:39:01.230000+00:00\n",
       "3  1776192  1024025121        0  2012-11-30 11:39:01.230000+00:00\n",
       "4  1776192  2972428928        0  2012-11-30 11:39:21.985000+00:00"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " \"\"\"\n",
    "test同.csv 有4列：\n",
    "user：用户ID\n",
    "event：事件ID\n",
    "invited：是否被邀请（0/1）\n",
    "timestamp：ISO-8601 UTC格式时间字符串，表示用户看到该事件的时间\n",
    "interested, and not_interested\n",
    "\n",
    "Test.csv 除了没有interested, and not_interested，其余列与train相同\n",
    " \"\"\"\n",
    "\n",
    "#读取数据\n",
    "test = pd.read_csv(\"test.csv\")\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10237 entries, 0 to 10236\n",
      "Data columns (total 4 columns):\n",
      "user         10237 non-null int64\n",
      "event        10237 non-null int64\n",
      "invited      10237 non-null int64\n",
      "timestamp    10237 non-null object\n",
      "dtypes: int64(3), object(1)\n",
      "memory usage: 320.0+ KB\n"
     ]
    }
   ],
   "source": [
    "test.info()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "再看一下测试数据，好像也没有什么特别"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>event_id</th>\n",
       "      <th>user_id</th>\n",
       "      <th>start_time</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>zip</th>\n",
       "      <th>country</th>\n",
       "      <th>lat</th>\n",
       "      <th>lng</th>\n",
       "      <th>c_1</th>\n",
       "      <th>...</th>\n",
       "      <th>c_92</th>\n",
       "      <th>c_93</th>\n",
       "      <th>c_94</th>\n",
       "      <th>c_95</th>\n",
       "      <th>c_96</th>\n",
       "      <th>c_97</th>\n",
       "      <th>c_98</th>\n",
       "      <th>c_99</th>\n",
       "      <th>c_100</th>\n",
       "      <th>c_other</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>684921758</td>\n",
       "      <td>3647864012</td>\n",
       "      <td>2012-10-31T00:00:00.001Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>244999119</td>\n",
       "      <td>3476440521</td>\n",
       "      <td>2012-11-03T00:00:00.001Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3928440935</td>\n",
       "      <td>517514445</td>\n",
       "      <td>2012-11-05T00:00:00.001Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2582345152</td>\n",
       "      <td>781585781</td>\n",
       "      <td>2012-10-30T00:00:00.001Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1051165850</td>\n",
       "      <td>1016098580</td>\n",
       "      <td>2012-09-27T00:00:00.001Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 110 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     event_id     user_id                start_time city state  zip country  \\\n",
       "0   684921758  3647864012  2012-10-31T00:00:00.001Z  NaN   NaN  NaN     NaN   \n",
       "1   244999119  3476440521  2012-11-03T00:00:00.001Z  NaN   NaN  NaN     NaN   \n",
       "2  3928440935   517514445  2012-11-05T00:00:00.001Z  NaN   NaN  NaN     NaN   \n",
       "3  2582345152   781585781  2012-10-30T00:00:00.001Z  NaN   NaN  NaN     NaN   \n",
       "4  1051165850  1016098580  2012-09-27T00:00:00.001Z  NaN   NaN  NaN     NaN   \n",
       "\n",
       "   lat  lng  c_1   ...     c_92  c_93  c_94  c_95  c_96  c_97  c_98  c_99  \\\n",
       "0  NaN  NaN    2   ...        0     1     0     0     0     0     0     0   \n",
       "1  NaN  NaN    2   ...        0     0     0     0     0     0     0     0   \n",
       "2  NaN  NaN    0   ...        0     0     0     0     0     0     0     0   \n",
       "3  NaN  NaN    1   ...        0     0     0     0     0     0     0     0   \n",
       "4  NaN  NaN    1   ...        0     0     0     0     0     0     0     0   \n",
       "\n",
       "   c_100  c_other  \n",
       "0      0        9  \n",
       "1      0        7  \n",
       "2      0       12  \n",
       "3      0        8  \n",
       "4      0        9  \n",
       "\n",
       "[5 rows x 110 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "活动描述信息在events.csv文件：共110维特征\n",
    "前9列：event_id, user_id, start_time, city, state, zip, country, lat, and lng.\n",
    "event_id：id of the event, \n",
    "user_id：id of the user who created the event.  \n",
    "city, state, zip, and country： more details about the location of the venue (if known).\n",
    "lat and lng： floats（latitude and longitude coordinates of the venue）\n",
    "start_time： 字符串，ISO-8601 UTC time，表示活动开始时间\n",
    "\n",
    "后101列为词频：count_1, count_2, ..., count_100，count_other\n",
    "count_N：活动描述出现第N个词的次数\n",
    "count_other：除了最常用的100个词之外的其余词出现的次数\n",
    " \"\"\"\n",
    "\n",
    "#读取数据\n",
    "events = pd.read_csv(\"events.csv\")\n",
    "events.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3137972 entries, 0 to 3137971\n",
      "Columns: 110 entries, event_id to c_other\n",
      "dtypes: float64(2), int64(103), object(5)\n",
      "memory usage: 2.6+ GB\n"
     ]
    }
   ],
   "source": [
    "events.info()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "看一下活动数据，数据量真的很大啊，真的连统计信息也没有了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>event</th>\n",
       "      <th>yes</th>\n",
       "      <th>maybe</th>\n",
       "      <th>invited</th>\n",
       "      <th>no</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1159822043</td>\n",
       "      <td>1975964455 252302513 4226086795 3805886383 142...</td>\n",
       "      <td>2733420590 517546982 1350834692 532087573 5831...</td>\n",
       "      <td>1723091036 3795873583 4109144917 3560622906 31...</td>\n",
       "      <td>3575574655 1077296663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>686467261</td>\n",
       "      <td>2394228942 2686116898 1056558062 3792942231 41...</td>\n",
       "      <td>1498184352 645689144 3770076778 331335845 4239...</td>\n",
       "      <td>1788073374 733302094 1830571649 676508092 7081...</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1186208412</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3320380166 3810793697</td>\n",
       "      <td>1379121209 440668682</td>\n",
       "      <td>1728988561 2950720854</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2621578336</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>855842686</td>\n",
       "      <td>2406118796 3550897984 294255260 1125817077 109...</td>\n",
       "      <td>2671721559 1761448345 2356975806 2666669465 10...</td>\n",
       "      <td>1518670705 880919237 2326414227 2673818347 332...</td>\n",
       "      <td>3500235232</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        event                                                yes  \\\n",
       "0  1159822043  1975964455 252302513 4226086795 3805886383 142...   \n",
       "1   686467261  2394228942 2686116898 1056558062 3792942231 41...   \n",
       "2  1186208412                                                NaN   \n",
       "3  2621578336                                                NaN   \n",
       "4   855842686  2406118796 3550897984 294255260 1125817077 109...   \n",
       "\n",
       "                                               maybe  \\\n",
       "0  2733420590 517546982 1350834692 532087573 5831...   \n",
       "1  1498184352 645689144 3770076778 331335845 4239...   \n",
       "2                              3320380166 3810793697   \n",
       "3                                                NaN   \n",
       "4  2671721559 1761448345 2356975806 2666669465 10...   \n",
       "\n",
       "                                             invited                     no  \n",
       "0  1723091036 3795873583 4109144917 3560622906 31...  3575574655 1077296663  \n",
       "1  1788073374 733302094 1830571649 676508092 7081...                    NaN  \n",
       "2                               1379121209 440668682  1728988561 2950720854  \n",
       "3                                                NaN                    NaN  \n",
       "4  1518670705 880919237 2326414227 2673818347 332...             3500235232  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " \"\"\"\n",
    "event_attendees.csv文件：共5维特征\n",
    "event_id：活动ID\n",
    "yes, maybe, invited, and no：以空格隔开的用户列表，\n",
    "分别表示该活动参加的用户、可能参加的用户，被邀请的用户和不参加的用户.\n",
    " \"\"\"\n",
    "\n",
    "#读取数据\n",
    "event_attendees = pd.read_csv(\"event_attendees.csv\")\n",
    "event_attendees.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 24144 entries, 0 to 24143\n",
      "Data columns (total 5 columns):\n",
      "event      24144 non-null int64\n",
      "yes        22160 non-null object\n",
      "maybe      20977 non-null object\n",
      "invited    22322 non-null object\n",
      "no         17485 non-null object\n",
      "dtypes: int64(1), object(4)\n",
      "memory usage: 943.2+ KB\n"
     ]
    }
   ],
   "source": [
    "event_attendees.info()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "看一下活动用户数据，发现数据好多缺失值啊"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>friends</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3197468391</td>\n",
       "      <td>1346449342 3873244116 4226080662 1222907620 54...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3537982273</td>\n",
       "      <td>1491560444 395798035 2036380346 899375619 3534...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>823183725</td>\n",
       "      <td>1484954627 1950387873 1652977611 4185960823 42...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1872223848</td>\n",
       "      <td>83361640 723814682 557944478 1724049724 253059...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3429017717</td>\n",
       "      <td>4253303705 2130310957 1838389374 3928735761 71...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         user                                            friends\n",
       "0  3197468391  1346449342 3873244116 4226080662 1222907620 54...\n",
       "1  3537982273  1491560444 395798035 2036380346 899375619 3534...\n",
       "2   823183725  1484954627 1950387873 1652977611 4185960823 42...\n",
       "3  1872223848  83361640 723814682 557944478 1724049724 253059...\n",
       "4  3429017717  4253303705 2130310957 1838389374 3928735761 71..."
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " \"\"\"\n",
    "user_friends.csv文件：共2维特征\n",
    "user：用户ID\n",
    "friends：以空格隔开的用户好友ID列表，\n",
    " \"\"\"\n",
    "\n",
    "#读取数据\n",
    "user_friends = pd.read_csv(\"user_friends.csv\")\n",
    "user_friends.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 38202 entries, 0 to 38201\n",
      "Data columns (total 2 columns):\n",
      "user       38202 non-null int64\n",
      "friends    38063 non-null object\n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 597.0+ KB\n"
     ]
    }
   ],
   "source": [
    "user_friends.info()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "看一下用户好友数据，也是有很多缺失值啊"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "因为本周作业是聚类，所以暂时不对数据进行特征工程。只是简单对其进行了解，方便之后的相关操作。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 用户和活动关联关系处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of uniqueUsers :3391\n",
      "number of uniqueEvents :13418\n"
     ]
    }
   ],
   "source": [
    " \"\"\"\n",
    "我们只关心train和test中出现的user和event，因此重点处理这部分关联数据\n",
    "\n",
    "train.csv 有6列：\n",
    "user：用户ID\n",
    "event：活动ID\n",
    "invited：是否被邀请（0/1）\n",
    "timestamp：ISO-8601 UTC格式时间字符串，表示用户看到该活动的时间\n",
    "interested, and not_interested\n",
    "\n",
    "Test.csv 除了没有interested, and not_interested，其余列与train相同\n",
    " \"\"\"\n",
    "from enum import Enum\n",
    "# 统计训练集中有多少不同的用户的events\n",
    "uniqueUsers = set()\n",
    "uniqueEvents = set()\n",
    "\n",
    "#倒排表\n",
    "#统计每个用户参加的活动   / 每个活动参加的用户\n",
    "eventsForUser = defaultdict(set)\n",
    "usersForEvent = defaultdict(set)\n",
    "    \n",
    "for filename in [\"train.csv\", \"test.csv\"]:\n",
    "    f = open(filename, 'rb')\n",
    "    \n",
    "    #忽略第一行（列名字）\n",
    "   # s1 = s.split(','.encode(enc))\n",
    "    f.readline().strip().split(\",\".encode(encoding = \"utf-8\"))\n",
    "    \n",
    "    for line in f:    #对每条记录\n",
    "        cols = line.strip().split(\",\".encode(encoding = \"utf-8\"))\n",
    "        uniqueUsers.add(cols[0])   #第一列为用户ID\n",
    "        uniqueEvents.add(cols[1])   #第二列为活动ID\n",
    "        \n",
    "        eventsForUser[cols[0]].add(cols[1])    #该用户参加了这个活动\n",
    "        usersForEvent[cols[1]].add(cols[0])    #该活动被用户参加\n",
    "    f.close()\n",
    "\n",
    "\n",
    "n_uniqueUsers = len(uniqueUsers)\n",
    "n_uniqueEvents = len(uniqueEvents)\n",
    "#np.savetxt('usersForEvent.csv',usersForEvent, delimiter = ',')  \n",
    "\n",
    "print(\"number of uniqueUsers :%d\" % n_uniqueUsers)\n",
    "print(\"number of uniqueEvents :%d\" % n_uniqueEvents)\n",
    "\n",
    "#用户关系矩阵表，可用于后续LFM/SVD++处理的输入\n",
    "#这是一个稀疏矩阵，记录用户对活动感兴趣\n",
    "userEventScores = ss.dok_matrix((n_uniqueUsers, n_uniqueEvents))\n",
    "userIndex = dict()\n",
    "eventIndex = dict()\n",
    "\n",
    "#重新编码用户索引字典\n",
    "for i, u in enumerate(uniqueUsers):\n",
    "    userIndex[u] = i\n",
    "    \n",
    "#重新编码活动索引字典    \n",
    "for i, e in enumerate(uniqueEvents):\n",
    "    eventIndex[e] = i\n",
    "\n",
    "n_records = 0\n",
    "ftrain = open(\"train.csv\", 'rb')\n",
    "ftrain.readline()\n",
    "for line in ftrain:\n",
    "    cols = line.strip().split(\",\".encode(encoding = \"utf-8\"))\n",
    "    i = userIndex[cols[0]]  #用户\n",
    "    j = eventIndex[cols[1]] #活动\n",
    "\n",
    "ftrain.close()\n",
    "\n",
    "  \n",
    "##统计每个用户参加的活动，后续用于将用户朋友参加的活动影响到用户\n",
    "#pickle.dump(eventsForUser, open(\"PE_eventsForUser.csv\", 'wb'))\n",
    "##统计活动参加的用户\n",
    "#pickle.dump(usersForEvent, open(\"PE_usersForEvent.csv\", 'wb'))\n",
    "\n",
    "#保存用户-活动关系矩阵R，以备后用\n",
    "sio.mmwrite(\"PE_userEventScores\", userEventScores)\n",
    "\n",
    "\n",
    "#保存用户索引表\n",
    "#pickle.dump(userIndex, open(\"PE_userIndex.csv\", 'wb'))\n",
    "#保存活动索引表\n",
    "pickle.dump(eventIndex, open(\"PE_eventIndex.csv\", 'wb'))\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "训练集和测试集中出现的用户数目和事件数目远小于users.csv出现的用户数和events.csv出现的事件数\n",
    "保存的PE_eventIndex.csv为只在训练集train.csv和测试集test.cv出现的活动索引表，所以我们后续的聚类将对该索引表进行操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 对活动进行聚类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "#提取出需要的数据\n",
    "event = ss.dok_matrix((13418, 101))\n",
    "eventIndex = pickle.load(open(\"PE_eventIndex.csv\", 'rb'))\n",
    "fevents = open(\"events.csv\", 'rb')\n",
    "fevents.readline()\n",
    "#event = dict()\n",
    "for line in fevents:\n",
    "    cols = line.strip().split(\",\".encode(encoding = \"utf-8\"))\n",
    "    eventId = str(cols[0])\n",
    "    #i = eventIndex(cols[0]) #活动\n",
    "    \n",
    "    if eventId in eventIndex:  #在训练集或测试集中出现\n",
    "        i = eventIndex[eventId]\n",
    "    \n",
    "    for j in range(9, 110):\n",
    "            event[i, j-9] = cols[j]\n",
    "fevents.close()    \n",
    "#pickle.dump(event, open(\"PE_event.csv\", 'wb'))\n",
    "sio.mmwrite(\"event\", event)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "把只在训练集train.csv和测试集test.cv出现的活动进行提取\n",
    "这里尝试了很多存为csv格式的方法，但是一直报错，所以无奈只能参考老师给的代码，存为mtx格式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "#读取数据\n",
    "import scipy.io as sio\n",
    "eventContMatrix = sio.mmread(\"event\") \n",
    "#event = pickle.load(open(\"PE_event.csv\", 'rb'))\n",
    "#event.readline()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 一个参数点（聚类数据为K）的模型，在校验集上评价聚类算法性能\n",
    "from sklearn.cluster import MiniBatchKMeans\n",
    "from sklearn import metrics\n",
    "def K_cluster_analysis(K, event):    \n",
    "    print(\"K-means begin with clusters: {}\".format(K));\n",
    "    \n",
    "    #K-means,在训练集上训练\n",
    "    mb_kmeans = MiniBatchKMeans(n_clusters = K)\n",
    "    mb_kmeans.fit(event)\n",
    "    \n",
    "    # 在训练集和测试集上测试\n",
    "    #y_train_pred = mb_kmeans.fit_predict(X_train)\n",
    "    event_pred = mb_kmeans.predict(event)\n",
    "    \n",
    "    #以前两维特征打印训练数据的分类结果\n",
    "    #plt.scatter(X_train[:, 0], X_train[:, 1], c=y_pred)\n",
    "    #plt.show()\n",
    "\n",
    "    # K值的评估标准\n",
    "    #常见的方法有轮廓系数Silhouette Coefficient和Calinski-Harabasz Index\n",
    "    #这两个分数值越大则聚类效果越好\n",
    "    #CH_score = metrics.calinski_harabaz_score(X_train,mb_kmeans.predict(X_train))\n",
    "    CH_score = metrics.silhouette_score(event,event_pred)\n",
    "    \n",
    "    #也可以在校验集上评估K\n",
    "    #v_score = metrics.v_measure_score(y_val, y_val_pred)\n",
    "    \n",
    "    print(\"CH_score: {}\".format(CH_score))\n",
    "    \n",
    "    return CH_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "K-means begin with clusters: 10\n",
      "CH_score: -0.03405413715837257\n",
      "K-means begin with clusters: 20\n",
      "CH_score: 0.11159483154674518\n",
      "K-means begin with clusters: 30\n",
      "CH_score: -0.08527052385421639\n",
      "K-means begin with clusters: 40\n",
      "CH_score: -0.07004268166964643\n",
      "K-means begin with clusters: 50\n",
      "CH_score: -0.03690762157988283\n",
      "K-means begin with clusters: 60\n",
      "CH_score: -0.07361507469833256\n",
      "K-means begin with clusters: 70\n",
      "CH_score: -0.0683961798437001\n",
      "K-means begin with clusters: 80\n",
      "CH_score: -0.06789694129131142\n",
      "K-means begin with clusters: 90\n",
      "CH_score: -0.05559981400932437\n",
      "K-means begin with clusters: 100\n",
      "CH_score: -0.1020813698466801\n"
     ]
    }
   ],
   "source": [
    "# 设置超参数（聚类数目K）搜索范围\n",
    "#event.toarray()\n",
    "#from sklearn import metrics\n",
    "CH_scores = []\n",
    "Ks = [10,20,30,40,50,60,70,80,90,100]\n",
    "for K in Ks:\n",
    "    ch = K_cluster_analysis(K, eventContMatrix)\n",
    "    CH_scores.append(ch)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "打印出运行结果，因为CH_score值越大越好，发现K为20时运行结果最佳"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-0.03405413715837257, 0.11159483154674518, -0.08527052385421639, -0.07004268166964643, -0.03690762157988283, -0.07361507469833256, -0.0683961798437001, -0.06789694129131142, -0.05559981400932437, -0.1020813698466801]\n"
     ]
    }
   ],
   "source": [
    "print (CH_scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x1f73a318860>]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAD8CAYAAABzTgP2AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3XmUXHWZ//H3Q1bCmhVCFrI12WiqxBb4iQghEcLRMQ47xyUiDuMRBvy5ICjKiMOMjOAOo8gi/lRIQEeigzKIXYDjiOmEkB3SCYE0xBBICIhk7e/vj6fuSVdTne5O3apbVffzOqdPdd2+VfdJpbo+/b3Pvd9rIQREREQiByRdgIiIVBcFg4iIFFAwiIhIAQWDiIgUUDCIiEgBBYOIiBRQMIiISAEFg4iIFFAwiIhIgb5JF7A/hg0bFsaNG5d0GSIiNWXRokUvhxCGd7deTQbDuHHjaGlpSboMEZGaYmbP9WQ97UoSEZECCgYRESmgYBARkQIKBhERKaBgEBGRAgoGEREpoGAQEZECCoYE3HcfvPhi0lWIiBSnYKiwF16A88+Hb34z6UpERIpTMFRYLue3S5YkWoaISJcUDBXWMRhCSLQUEZGiFAwVlsvBAQfAyy/Dxo1JVyMi8lYKhgpqa4PWVpgzx+9rd5KIVCMFQwVFu5GuvNJvn3oqsVJERLqkYKigXA4GD4ZTToFx4zRiEJHqFEswmNlsM3vazFrN7OoiP3+3mS02s91mdm6nn801szX5r7lx1FOtmpvh1FO9x5DNasQgItWp5GAwsz7ALcBZwDTgIjOb1mm154GPAj/r9NghwHXAicAJwHVmNrjUmqrR88/DunVw2ml+P5uFZ56BN95ItCwRkbeIY8RwAtAaQlgXQtgJ3AvM6bhCCGF9CGEp0N7psWcCD4cQtoQQtgIPA7NjqKnqPPqo30bBkMn44arLlydWkohIUXEEwyhgQ4f7bfll5X5sTcnlYMgQaGz0+9ms36rPICLVJo5gsCLLenrqVo8fa2aXmlmLmbVs3ry5x8VVi479BYCjj4bDDlOfQUSqTxzB0AaM6XB/NNDTKeJ6/NgQwm0hhKYQQtPw4cP3q9CkPPccPPvs3t1IAGa+O0kjBhGpNnEEw0KgwczGm1l/4EJgQQ8f+xBwhpkNzjedz8gvqyud+wuRTAaWLoX2zp0XEZEElRwMIYTdwOX4B/oqYH4IYYWZXW9m7wcws3eYWRtwHvADM1uRf+wW4Kt4uCwErs8vqyu5HAwdCsceW7g8m/WjktauTaQsEZGi+sbxJCGEB4EHOy37cofvF+K7iYo99k7gzjjqqFad+wuRTMZvlyyBhobK1yUiUozOfC6z9ev9q/NuJIDp06FPHzWgRaS6KBjKrKv+AsDAgTBlihrQIlJdFAxlFvUXpk8v/nNNjSEi1UbBUGZd9RcimYxPx/3KK5WtS0SkKwqGMlq/3s9hmDGj63WiM6A1ahCRaqFgKKPo+gvF+guRjkcmiYhUAwVDGeVyMGwYTOs812wHI0bAyJEaMYhI9VAwlEkI3fcXItmsRgwiUj0UDGWyfr1fg2Ff/YVIJgOrVsHOnWUvS0SkWwqGMulJfyGSzcKuXbByZTkrEhHpGQVDmfSkvxCJGtDqM4hINVAwlEHUXzjtNJ9euzsNDXDggeoziEh1UDCUwbPPwoYNPesvgM+X1NioEYOIVAcFQxn0pr8QiY5MCj299p2ISJkoGMogl4Phw2Hq1J4/JpOBrVt9pCEikiQFQ8x621+IaGoMEakWCoaYrVvnk+L1tL8QaWz0WzWgRSRpCoaY7U9/AeCQQ2DSJI0YRCR5CoaY5XI+/9GUKb1/rKbGEJFqoGCI0f72FyKZDKxdC6+/HntpIiI9pmCI0dq18MILve8vRKIG9NKl8dUkItJbCoYY7W9/IaKpMUSkGigYYpTLwRFHwOTJ+/f40aNhyBD1GUQkWQqGmJTaXwB/XCajEYOIJEvBEJPWVnjxxf3vL0SyWVi2DPbsiacuEZHeUjDEpNT+QiSTgTffhDVrSq1IRGT/KBhiksvBkUfCMceU9jzRkUnqM4hIUhQMMYijvxCZOhX69VMwiEhyFAwxWLMGNm4svb8A0L+/X/VNDWgRSYqCIQZx9RcimYxGDCKSHAVDDHI5GDnSL9EZh2wW/vIX2LQpnucTEekNBUOJ4uwvRHRtBhFJkoKhRM8843/dx9FfiGhqDBFJkoKhRHH3F8CnxRgzRn0GEUmGgqFEuRwcdZRfZCdOmhpDRJKiYChBOfoLkWwWVq+G7dvjfV4Rke4oGErw9NN+5FCc/YVIJuPzJa1YEf9zi4jsi4KhBOXoL0Q0NYaIJCWWYDCz2Wb2tJm1mtnVRX4+wMzm5X/+hJmNyy8fZ2ZvmtmS/Nf346inUnI5GDUKJk6M/7knTICDD1afQUQqr2+pT2BmfYBbgPcAbcBCM1sQQljZYbVLgK0hhElmdiFwI3BB/mdrQwjZUuuotBA8GGbNir+/AHDAAXDccRoxiEjlxTFiOAFoDSGsCyHsBO4F5nRaZw5wd/77+4GZZuX4OK2c1avL11+IREcmhVC+bYiIdBZHMIwCNnS435ZfVnSdEMJuYBswNP+z8Wb2pJk9amandLURM7vUzFrMrGXz5s0xlF2acvYXItksvPYarF9fvm2IiHQWRzAU+8u/89+4Xa2zERgbQngb8GngZ2Z2aLGNhBBuCyE0hRCahg8fXlLBccjl/BrNEyaUbxvRGdDanSQilRRHMLQBYzrcHw282NU6ZtYXOAzYEkLYEUJ4BSCEsAhYC5R4qZvyi/oL5Th/oaPGRu81qAEtIpUURzAsBBrMbLyZ9QcuBBZ0WmcBMDf//bnA70MIwcyG55vXmNkEoAFYF0NNZbVqFbz0Unn7CwCDBvkV4TRiEJFKKvmopBDCbjO7HHgI6APcGUJYYWbXAy0hhAXAHcD/M7NWYAseHgDvBq43s93AHuATIYQtpdZUbpXoL0QyGXjiifJvR0QkUnIwAIQQHgQe7LTsyx2+3w6cV+RxPwd+HkcNlZTL+SR348eXf1vZLMybB6++CocfXv7tiYjozOdeqlR/IRI1oJcuLf+2RERAwdBrK1fC5s3l7y9ENDWGiFSagqGXKtlfADjySBg+XEcmiUjlKBh6KZeDsWNh3LjKbM/MRw0aMYhIpSgYeqG9vbL9hUgm49Nv79pVuW2KSHopGHph5Up4+eXK9Rci2Szs2OHXfxARKTcFQy9Uur8QiY5MUp9BRCpBwdALuRwcfXTl+guRyZNhwAD1GUSkMhQMPdSxv1Bp/frB9OkKBhGpDAVDD61YAa+8Uvn+QiSb1bUZRKQyFAw9FPUXTj01me1ns35i3caNyWxfRNJDwdBDuZz3FirdX4ioAS0ilaJg6IEk+wsRXbRHRCpFwdADy5fDli3J9RcADjvMRysaMYhIuSkYeiDp/kJEU2OISCUoGHogl/NrLxx9dLJ1ZDLwzDPwxhvJ1iEi9U3B0I32dnj00WT7C5Fs1g9XXb486UpEpJ4pGLqxbFny/YWIjkwSkUpQMHSjWvoL4M3nQw9Vn0FEykvB0I1cDiZM8GswJM3MRw0aMYhIOSkY9qGa+guRaGqM9vakKxGReqVg2IelS2Hr1uoKhkzGj0paty7pSkSkXikY9iGp6y/sSzbrt+oziEi5KBj2IZeDiRNhzJikK9lr+nTo00fBICLlo2Dowp491ddfABg4EKZMUQNaRMpHwdCFpUvh1VerLxhAU2OISHkpGLpQjf2FSCYDbW1+4SARkbgpGLqQy8GkSTB6dNKVvFXUgNbuJBEpBwVDEdXaX4hoagwRKScFQxFPPQXbtlVvMIwYASNHqs8gIuWhYCiimvsLEU2NISLlomAoorkZGhpg1KikK+laNgsrV8LOnUlXIiL1RsHQyZ498Nhj1T1aAB8x7NoFq1YlXYmI1BsFQydLlsBrr1V/MGhqDBEpFwVDJ7XQXwDf1XXggeoziEj8FAydNDfDMcfAUUclXcm+9ekDjY0aMYhI/BQMHezeDY8/Xv2jhUg0NUYISVciIvUklmAws9lm9rSZtZrZ1UV+PsDM5uV//oSZjevws2vyy582szPjqGd/1Up/IZLJ+PUi2tqSrkRE6knJwWBmfYBbgLOAacBFZjat02qXAFtDCJOAbwI35h87DbgQmA7MBm7NP18iaqW/EFEDWkTKIY4RwwlAawhhXQhhJ3AvMKfTOnOAu/Pf3w/MNDPLL783hLAjhPAs0Jp/vkQ0N8PkyX5WcS1obPRbNaBFJE5xBMMoYEOH+235ZUXXCSHsBrYBQ3v4WADM7FIzazGzls2bN8dQdqFa6y8AHHKIT/SnEYOIxCmOYLAiyzq3Q7tapyeP9YUh3BZCaAohNA0fPryXJXbvySfh9ddrKxhAU2OISPziCIY2oOPFL0cDL3a1jpn1BQ4DtvTwsRVRa/2FSDYLra0eaiIicYgjGBYCDWY23sz6483kBZ3WWQDMzX9/LvD7EELIL78wf9TSeKAB+HMMNfVac7NfMvPII5PY+v6LpuBetizZOkSkfpQcDPmeweXAQ8AqYH4IYYWZXW9m78+vdgcw1MxagU8DV+cfuwKYD6wEfgtcFkLYU2pNvVWL/YWIjkwSkbj1jeNJQggPAg92WvblDt9vB87r4rE3ADfEUcf+WrwY/vrX2gyG0aNh8GD1GUQkPjrzmdrtLwCY7T0DWkQkDgoGvL8wdSoccUTSleyfTMZ7DHsqvhNOROpR6oNh1y74wx9qc7QQyWbhzTdhzZqkKxGRepD6YKjl/kIkakCrzyAicUh9MNRyfyEydSr066c+g4jEI/XB0NwM06bBiBFJV7L/+vf3f4OCQUTikOpgqIf+QkRTY4hIXFIdDIsWwRtv1EcwZLOwcSO89FLSlYhIrUt1MET9hVNPTbSMWERTY2jUICKlSnUwNDfD9Om13V+IRMGgPoOIlCq1wVBP/QWAoUN9egyNGESkVKkNhpYW+Nvf6icYQFNjiEg8UhsM9dRfiGQysHo1bN+edCUiUstSGwzNzXDssVCGi8ElJpv1+ZJWrEi6EhGpZakMhp074X/+p752I4GOTBKReKQyGOqxvwAwcSIcdJD6DCJSmlQGQz32FwAOOEBnQItI6VIZDM3N0NgIw4YlXUn8MhkfMYSQdCUiUqtSFwz12l+IZLPw2muwfn3SlYhIrUpdMCxc6Be1qddgUANaREqVumCo1/5CpLHRew1qQIvI/kpdMDQ3w3HH+RQS9WjQIGho0IhBRPZfqoJhxw744x/rdzdSRFNjiEgpUhUM9d5fiGQy3nx+9dWkKxGRWpSqYMjlwKx++wuRbNZvly5Ntg4RqU2pCoaovzBkSNKVlJeOTBKRUvRNuoBKGjeu/ncjAYwc6ZMDqs8gIvsjVcFwxx1JV1AZZpoao7dC8P7ToEFJVyKSvFTtSkqTbBaWL4fdu5OupPq99hrMmgVjxsCiRUlXI5I8BUOdymb98Nynn066kur2l7/4wQiPPQb9+8N73gOLFyddlUiyFAx1KmpAq8/QtdZWOPlkeOYZ+NWv4H//Fw45xEcPTz6ZdHUiyVEw1KnJk2HAAPUZurJ4sYfCtm3w+9/D7Nl+cEIutzccFKqSVgqGOtWvH0yfrg+3Yh55xHcfDRzoM+2eeOLen40f74c1H3QQzJyp10/SScFQx6KpMXRthr3mz4ezzvLRwR//6COrziZM8JFDFA4adUnaKBjqWCYDmzd7g1Xgu9+FCy+Ek07yZvOoUV2vO2GCjxwGDVI4SPooGOpYNDVG2neHhADXXgtXXAHvfz889BAMHtz94yZO9JHDgQd6OGiKEUkLBUMdO+44v03zX7u7d8M//APccIPf3n+/f9D31MSJPnIYONDDYdmy8tUqUi1KCgYzG2JmD5vZmvxt0b/DzGxufp01Zja3w/KcmT1tZkvyXyNKqUcKHX6470tP64jhzTfhnHP8jPcvfQl+8APoux/n+k+a5COHAQPg9NMVDlL/Sh0xXA08EkJoAB7J3y9gZkOA64ATgROA6zoFyAdDCNn810sl1iOdpHVqjK1b4Ywz/PyE730Prr/epwrZX5Mm+cihf38Ph+XL46tVpNqUGgxzgLvz398NfKDIOmcCD4cQtoQQtgIPA7NL3K70UDbrJ3D97W9JV1I5L7wAp5wCf/4zzJsHl10Wz/M2NPjIQeEg9a7UYDgihLARIH9bbFfQKGBDh/tt+WWRu/K7kb5kVsrfdFJMJgPt7en5EFu9Gt75Tnj+efjNb+C88+J9/oYGHzn07evhsGJFvM8vUg26DQYz+52ZLS/yNaeH2yj2YR8dWf/BEEIjcEr+68P7qONSM2sxs5bNmzf3cNOSpiOTnngC3vUu2L7d/7I//fTybOeYY/z5o3BYubI82xFJSrfBEEKYFUI4tsjXA8AmMxsJkL8t1iNoA8Z0uD8aeDH/3C/kb18Hfob3ILqq47YQQlMIoWn48OE9/fel3rhxcOih9d9n+M1v/EP6sMP8xLXjjy/v9o45xkcOffrAjBkKB6kvpe5KWgBERxnNBR4oss5DwBlmNjjfdD4DeMjM+prZMAAz6we8D0jJDo/Kia7NUM8jhh//2M9PmDzZQ2HixMpsd/Jkn2fpgAM8lFatqsx2Rcqt1GD4GvAeM1sDvCd/HzNrMrPbAUIIW4CvAgvzX9fnlw3AA2IpsAR4AfhhifVIEdmsn5zV3p50JfG76SaYOxfe/W7fvXPEEZXd/pQpPnIAHzmsXl3Z7YuUg4UanEinqakptLS0JF1GzbjjDvj4x2HNGj/ssh60t8NVV8HNN8P55/uoYcCA5OpZtcqDwcyDYsqU5GqReITgR7bNnw8LF8IPf1h8bq1aYmaLQghN3a2nM59ToN4a0Lt2+Sjh5pvh8svhnnuSDQWAqVN9t1J7uweELpBUm0LwEPjc57w/d9JJfh7M4sVw9tnw178mXWFlKBhSYPp0b5LWQwP6jTe8n/CTn8C//At85zu+j78aTJvmowWFQ20JwS/p+vnP++SJJ5wA3/42NDbC3XfDpk3wwAO+m/BjH0vHbMVV8isl5TRwoO/aqPURw8sve5P3v//bh/Vf/GJpZzOXw7RpPnLYvdvD4Zlnkq5IignBfx++8AU/N6WpCb7xDf89uesuD4Nf/xo+8hGfWmbmTPi3f4P77vP16t1+zBwjtSiTgccfT7qK/ffcc3DmmX77i1/AnJ6eRZOA6dN95DBjhn/lcv7hI8kKwee5mj/fv9as8ZH0zJlwzTXwgQ/A0KFdP/5zn/Oew1VX+eHQM2ZUrvZK04ghJbJZ2LABtmxJupLeW7bMz2betMlHC9UcCpHp033ksGsXnHaafwhJ5YXgZ/1/+cveB8pk/C//o4+G227za5U89BBccsm+QwF8dHrXXX4OywUX+O9TvVIwpEQm47e11md4/HGf9wj84jrR97Xg2GM9HHbu9L8uW1uTrig9Vq6Ef/5nD+jGRp92fdQo+P73YeNGePhhn4Z92LDePe8hh8B//qefXX/uubBjR1nKT5yCISWiYKilPsMDD/gMqUce6SeuNTYmXVHvReGwY4ePHBQO5fP00/DVr/r7ZPp0n1F3xAi45RZ48UW/1vc//qMvK8WUKfCjH/lupSuuiKX0qqNgSIkjjvAP2FoZMdx+ux8eeNxx8Ic/+NC/VjU2+ofS9u0+cli7NumK6seaNT4ayGT8A/u66/zqfN/9rs+ym8vBJz8Z/4mPZ58NV1/tu6PuvDPe564GOsEtRc46y4fR1TxqCAH+9V/9UpyzZ/sV1w46KOmq4rF0qR9VdeCB/oFVqak76s3atXsbyNF7+eST/UTHc87Z97W847Rnj79HH3/c/3hp6va0seTpBDd5i2zW973u3Jl0JcXt2eND82uvhQ99CBYsqJ9QAB/9PPKIX1luxgxYty7pimrHs8/Cv/+7f/hOmuSHmQ4c6IeOPv+8fzBfcUXlQgH8iKZ77vHRyDnn+OHU9UKHq6ZIJuNHyaxatbfnUC127PBjxufPh89+Fm68sXpOXItTJuPhcPrp3nPI5fykqmrU3u7nY+zZk9zt66/Df/2Xn40MfvLZTTd547cadi8OGwY//7lP937RRfDb33pg1DoFQ4pEU2M89VR1BcNrr8Hf/703ab/+dQ+GehaFw8yZe89zGD++8nWE4LsW16zxr9bWvbetrdVz1b+mJh8tnHtuMq9Td5qa4NZb/ZDXa6/1w2FrnYIhRRoafP/2kiX+13k1WLbM5z1atswnwvtwl5dqqi/ZLPzudx4Op50Gjz7qc/PELQQ/Iqfjh35XH/79+vnopaHBA2vwYL8YUZ8+yd326weDBsX/usTtYx/zC0V97Wvwjnd4c7qWqfmcMieeCAcf7H+xJmXnTj97+ZZbfN/wwQf7LqSzzkqupqQ8+aSHw6GH+shhf8Ih+vDv/MG/Zo03arv68J80qfB27Nj62A2SlB07fPr3lSt911c1zrDb0+azRgwpk8n4PtEQKj/PUFsb/OAHPs/Rpk3+AXXTTfDRj3Z/1mm9etvbfOQwa9be3UrF9p23t+/d7VPsr/8339y7br9+fsTTpEn+vB0DQB/+5TNggP9uHX+87xr985/9hLhapGBImWzWP5jb2mDMmO7XL1UI3ju49VY/Ya29Hd77Xj+2/Mwz67PB3FvHH1+4W+l733vr7p/OH/79+3uwdv7wb2jw/1d9+Cdj9GiYN8//Ty6+2Cfdq7aJHntCwZAyHafGKGcwbNvmPYNbb/XpiocOhc98Bj7xiepsICYtCodZs+B97/Nl0Yd/Q4Mv77jbRx/+1WvGDG+Wf/azfjDFVVclXVHvKRhS5rjj/HbJkr0fQHFatsx7Bz/5iV874YQTfE7788/3486la29/u79+q1d7AOjDv3Z9+tO+K+maa/z/debMpCvqHQVDyhxyiO9/jnNqjM7N5IED/ZjuT36yNs4GrSajR/uX1DYzv6Tu8uVw4YV+IaCxY5Ouque0hzeFstl4psVoa4Mvfcnf8Bdd5PvFb7rJl995p0JB0u3gg/0Ppp07/czo7duTrqjnFAwplMn4YYyvv977x4bgh7qefbYfWnnDDX7c9oMPeqP0M59J7xFGIp1Nnuy9tpYW+Kd/SrqanlMwpFA2u/dqVj21bZtfX3nqVG+EPvaYh8DatfCrX/k5CDrCSOSt5szxuZ1uv92PCKwF+lVOoY5TY3Rn6VI/kmjUKLjySjjsMG8mt7X5fEY6wkike9df79cWufxyb0pXOzWfU2j0aJ/uoKs+g5rJIvHq0wd+9jP//TnnHG9Gl3rBoHLSiCGFzHzU0HnEsGGDmski5TJ0qJ8Z/fLLfqTS7t1JV9Q1BUNKZTK+m2j37r3N5PHj1UwWKafjj/frTjc3e9+hWmlXUkplsz7FQkMDrF+vM5NFKmXuXJ+J9etf9z/Czjsv6YreSsGQUief7FMbjxgBX/mKzkwWqaRvfctn1r34Ypg+HaZNS7qiQgqGlJo0yc9jUBiIVF7//n4982gm1oULfer1aqEeQ4opFESSM2qUX4dk7VrfvdTennRFeykYREQScuqp3mv45S/9vKBqoWAQEUnQpz7lh69eey08/HDS1TgFg4hIgsx8uoxp0/z8ofXrk65IwSAikriDDvLZBnbt8jOjO16tLwkKBhGRKtDQ4Be4WrwYLrvMJ7pMioJBRKRK/N3f+bQ0d90Ft92WXB0KBhGRKnLddTB7tl+/4U9/SqaGkoLBzIaY2cNmtiZ/O7iL9X5rZq+a2a87LR9vZk/kHz/PzPqXUo+ISK3r0wd++lOfBfncc2HTpsrXUOqI4WrgkRBCA/BI/n4xXwc+XGT5jcA384/fClxSYj0iIjVvyBBvRr/yClxwQeVnYi01GOYAd+e/vxv4QLGVQgiPAAUXkjQzA04H7u/u8SIiaZPNep/h0Ufh85+v7LZLDYYjQggbAfK3vbn0xFDg1RBClIVtwKgS6xERqRsf/rBf9e0b34B58yq33W4n0TOz3wFHFvnRF0vcthVZ1uUBWmZ2KXApwNixY0vctIhIbbj5Zj+E9ZJLfCbWY48t/za7DYYQwqyufmZmm8xsZAhho5mNBF7qxbZfBg43s775UcNo4MV91HEbcBtAU1NTgkf4iohUTv/+cN998Pa3+wW1Fi70a6+XU6m7khYAc/PfzwUe6OkDQwgBaAbO3Z/Hi4ikxVFH+UysjY2V2Z6FEk6vM7OhwHxgLPA8cF4IYYuZNQGfCCF8PL/e48AU4GDgFeCSEMJDZjYBuBcYAjwJfCiEsKO77TY1NYWWlpb9rltEJI3MbFEIodurt5d0oZ4QwivAzCLLW4CPd7h/ShePXwecUEoNIiISL535LCIiBRQMIiJSQMEgIiIFFAwiIlJAwSAiIgUUDCIiUkDBICIiBUo6wS0pZrYZeC7pOko0DJ8WRPRadKbXo5Bej71KfS2ODiEM726lmgyGemBmLT05AzEN9FoU0utRSK/HXpV6LbQrSURECigYRESkgIIhObclXUAV0WtRSK9HIb0ee1XktVCPQURECmjEICIiBRQMZWZmY8ys2cxWmdkKM7syv3yImT1sZmvyt4OTrrWSzKyPmT1pZr/O3x9vZk/kX495ZtY/6RorwcwON7P7zWx1/j3yf9L83jCz/5v/PVluZveY2cA0vTfM7E4ze8nMlndYVvT9YO47ZtZqZkvN7Pi46lAwlN9u4DMhhKnAScBlZjYNuBp4JITQADySv58mVwKrOty/Efhm/vXYClySSFWV923gtyGEKUAGf01S+d4ws1HAFUBTCOFYoA9wIel6b/wImN1pWVfvh7OAhvzXpcB/xFWEgqHMQggbQwiL89+/jv/ijwLmAHfnV7sb+EAyFVaemY0G3gvcnr9vwOnA/flVUvF6mNmhwLuBOwBCCDtDCK+S4vcGfvGwA82sLzAI2EiK3hshhMeALZ0Wd/V+mAP8OLg/AYeb2cg46lAwVJCZjQPeBjwBHBFC2AgeHsCI5CqruG8BVwHt+ftDgVdDCLvz99vw8Kx3E4DNwF353Wq3m9lBpPS9EUJ4AbgJv0zwRmAbsIh0vjc66ur9MArY0GG92F4bBUOFmNnBwM+BT4UQXku6nqSY2fvJkoulAAABrElEQVSAl0IIizouLrJqGg6X6wscD/xHCOFtwBukZLdRMfl953OA8cBRwEH47pLO0vDe6Imy/d4oGCrAzPrhofDTEMIv8os3RcO+/O1LSdVXYScD7zez9cC9+G6Cb+HD4Oga5KOBF5Mpr6LagLYQwhP5+/fjQZHW98Ys4NkQwuYQwi7gF8A7Sed7o6Ou3g9twJgO68X22igYyiy///wOYFUI4RsdfrQAmJv/fi7wQKVrS0II4ZoQwugQwji8sfj7EMIHgWbg3PxqqXg9Qgh/ATaY2eT8opnASlL63sB3IZ1kZoPyvzfR65G690YnXb0fFgAfyR+ddBKwLdrlVCqd4FZmZvYu4HFgGXv3qX8B7zPMB8bivxDnhRA6N53qmpmdBnw2hPA+M5uAjyCGAE8CHwoh7EiyvkowsyzehO8PrAMuxv9gS+V7w8y+AlyAH833JPBxfL95Kt4bZnYPcBo+i+om4DrglxR5P+TD83v4UUx/Ay4OIbTEUoeCQUREOtKuJBERKaBgEBGRAgoGEREpoGAQEZECCgYRESmgYBARkQIKBhERKaBgEBGRAv8fL2oYBI4FwuAAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1f73a656710>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 绘制不同聚类数目的模型的性能，找到最佳模型／参数（分数最高）\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "plt.plot(Ks, np.array(CH_scores), 'b-')"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "以图表的形式呈现结果，发现最高也仅为0.11，所以可以判定聚类效果并不理想"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 作业总结"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "内容小结：\n",
    "1）首先对1. Users_Events进行修改，提取出所需的数据\n",
    "2）按照课程中的案例步骤对提取的数据进行聚类\n",
    "3）因为由于样本数目较多，所以采用建议的方法，使用MiniBatchKMeans对数据进行聚类\n",
    "4）通过几周的学习，解决问题能力也有了提高，对机器学习也有了初步的了解\n",
    "\n",
    "问题小结：\n",
    "1）对其他聚类方法的使用场景还不明确，需要后续进行相关练习\n",
    "2）对dict字典类型了解较少，做作业过程中的错误大多也是数据类型不匹配造成的\n",
    "3）对于理论问题还有不了解的地方，还需要反复学习，深入了解\n",
    "4）在过程中出现了b''类型的错误，666之后也没有很好的解决方法，现在依然不明白原因\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
